Introduction to Data Wrangling and Visualization with Python

Kuo, Yao-Jen 郭耀仁

Agenda

  • About Me
  • A Short Glimpse of a Data Science Project
  • Data Wrangling with Python
  • Data Visualization with Python

A Short Glimpse of a Data Science Project

TL; DR

Data science is a huge field, it invovles applications and tools like importing, tidying, transformation, visualization, modeling, and communication. Surrounding all these is programming.

Source: R for Data Science

A typical data science project is a pipeline looks like

Imgur

Source: R for Data Science

How does Python tackle all these applications?

Source: https://giphy.com/

PEP 206(Python Enhancement Proposals 206): Python Advanced Library plays the key role

The Python source distribution has long maintained the philosophy of "batteries included": having a rich and versatile standard library which is immediately available

  • random
  • datetime
  • json
  • ...etc.

However, the standard library modules aren't always the best choices for a specific application

A list of third-party modules that make Python the go-to choice for most of the data science teams

  • requests
  • numpy
  • pandas
  • matplotlib
  • scikit-learn
  • scipy
  • tensorflow
  • pytorch
  • flask
  • dash
  • ...etc.

Import: This typically means that we take data stored in a file, database, or web API, and load it into a data structure in Python

  • csv
  • json
  • numpy
  • pandas
  • requests
  • beautifulsoup4
  • lxml
  • sqlite3

Tidy: Storing data in a consistent form that matches the semantics of the dataset. Tidy data is important because the more consistent the easier for us to get it into the right form for different functions

  • numpy
  • pandas
  • Functional programming techniques

Transform: For instance narrowing in on observations of interest, creating new variables that are functions of existing variables, and calculating a set of summary statistics

  • numpy
  • pandas
  • Functional programming techniques

Together, tidying and transforming are called WRANGLING!

Source: https://giphy.com/

Visualize: A fundamental human activity. A good visualisation will show you things that you did not expect, or raise new questions about the data.

  • pandas
  • matplotlib
  • seaborn

Model: Once we've made questions sufficiently precise, we can use a mathematical or computational tool to answer them

  • scikit-learn
  • scipy
  • tensorflow
  • pytorch

Communicate: It doesn’t matter how well our project is unless we can communicate the results to others

  • jupyter
  • dash

Communication matters

It should be possible to explain the laws of physics to a barmaid.

Source: Einstein: His Life and Times

Data Wrangling with Python

TL; DR

We brief the what and why on pandas and a quickstart of common DataFrame operations with 91APP dataset.

What is pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more.

Source: https://github.com/pandas-dev/pandas

Why pandas

Python has always been a popular language for web scraping, however there is a weak spot in its analysis capability. Python used to switch to a more data-centric language like R or Matlab during the analysis stage.

Quick question: how does panda relate to data?

Source: https://giphy.com/

Turns out it has nothing to do with panda...

pandas refers to three primary class customed by its author Wes McKinney

  • Panel
  • DataFrame
  • Series

Data analysts mainly deal with Index, ndarray, Series, DataFrame

  • DataFrame
    • Series
      • Index
      • ndarray

Libraries we need for the following demo

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sqlite3

Combining an Index and a ndarray can form a Series

In [2]:
movie_index = pd.Index(['The Avengers', 'Avengers: Age of Ultron', 'Avengers: Infinity War', 'Avengers: Endgame'])
movie_ratings = np.array([8, 7.3, 8.5, 8.4])
print(type(movie_index))
print(type(movie_ratings))
<class 'pandas.core.indexes.base.Index'>
<class 'numpy.ndarray'>
In [3]:
rating_series = pd.Series(movie_ratings, movie_index)
print(type(rating_series))
print(rating_series)
<class 'pandas.core.series.Series'>
The Avengers               8.0
Avengers: Age of Ultron    7.3
Avengers: Infinity War     8.5
Avengers: Endgame          8.4
dtype: float64

And vice versa...

In [4]:
rating_series = pd.Series([8, 7.3, 8.5, 8.4], ['The Avengers', 'Avengers: Age of Ultron', 'Avengers: Infinity War', 'Avengers: Endgame'])
print(type(rating_series))
print(type(rating_series.index))
print(type(rating_series.values))
<class 'pandas.core.series.Series'>
<class 'pandas.core.indexes.base.Index'>
<class 'numpy.ndarray'>

Combining multiple Series with the same Index can form a DataFrame

In [5]:
release_year_series = pd.Series([2012, 2015, 2018, 2019], ['The Avengers', 'Avengers: Age of Ultron', 'Avengers: Infinity War', 'Avengers: Endgame'])
movie_df = pd.DataFrame()
movie_df['rating'] = rating_series
movie_df['release_year'] = release_year_series
print(type(movie_df))
movie_df
<class 'pandas.core.frame.DataFrame'>
Out[5]:
rating release_year
The Avengers 8.0 2012
Avengers: Age of Ultron 7.3 2015
Avengers: Infinity War 8.5 2018
Avengers: Endgame 8.4 2019

And vice versa...

In [6]:
print(type(movie_df['rating']))
print(type(movie_df['rating'].index))
<class 'pandas.core.series.Series'>
<class 'pandas.core.indexes.base.Index'>

Main pandas functionalities which data analysts rely on are

  • Import
  • Wrangling(Tidy + Transform)
  • Visualize

pandas has massive functions importing tabular data

  • Flat text file
  • Database table
  • Spreadsheet
  • Array of JSONs
  • ...etc.

Source: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

Flat text files: 91APP Dataset

  • BehaviorDataForNTU.csv
  • MemberDataForNTU.csv
  • OrderDataForNTU.csv
  • OrderSlaveDataForNTU.txt

Source: NTU資料分析與決策社

Check file encoding before importing

  • Use text editor Notepad++
  • Use command line file --mime-encoding <filename>
    • macOS/Linux: Terminal
    • Windows: Git Bash
In [7]:
!file --mime-encoding *.csv
!file --mime-encoding *.txt
BehaviorDataForNTU.csv: utf-8
MemberDataForNTU.csv:   us-ascii
OrderDataForNTU.csv:    us-ascii
OrderSlaveDataForNTU.txt: us-ascii

Importing using pandas read_csv functions

In [8]:
# for example
order_slave = pd.read_csv('OrderSlaveDataForNTU.txt')
/Users/kuoyaojen/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3063: DtypeWarning: Columns (5,6) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [9]:
order_slave.shape
Out[9]:
(1159830, 15)
In [10]:
order_slave.dtypes
Out[10]:
UUID                    object
TransactionNum          object
TradesDateTime          object
ChannelType             object
ChannelDetail           object
PaymentType             object
ShippingType            object
OuterProductSkuCode     object
ProductSkuCode         float64
SalePageCode           float64
Qty                      int64
TotalSalesAmount         int64
TotalPrice               int64
TotalDiscount            int64
Status                  object
dtype: object

Assigning data types if there is a warning message

In [11]:
dtypes = {
    'PaymentType': 'str',
    'ShippingType': 'str'
}
order_slave = pd.read_csv('OrderSlaveDataForNTU.txt', dtype=dtypes)

Why DtypeWarning?

Pandas is built on numpy, numpy records missing values with np.NaN which is a float type.

In [12]:
print(type(np.NaN))
<class 'float'>

Importing other files as well

In [13]:
dtypes = {
    'CategoryId': 'str',
    'SearchKeyWord': 'str',
    'TransactionNum': 'str'
}
behavior = pd.read_csv('BehaviorDataForNTU.csv', dtype=dtypes)
member = pd.read_csv('MemberDataForNTU.csv')
order_main = pd.read_csv('OrderDataForNTU.csv')

Using shape attribute to check layout

In [14]:
print(behavior.shape)
print(member.shape)
print(order_main.shape)
print(order_slave.shape)
(15081048, 15)
(159835, 11)
(508083, 18)
(1159830, 15)

Using head method for a peek

In [15]:
behavior.head()
Out[15]:
VisitorId OperationSystem SessionNumber TrafficSourceCategory HitDateTime BehaviorType SourceType OnlineMemberId CategoryId SearchKeyWord TransactionNum ProductPrice ProductQuantity ProductId TransactionRevenue
0 d960a83d-f413-4659-813c-6cd1610a1292 iOS 3.0 Direct 2018-05-16 23:14:31.500 Purchase APP 2783844908 NaN NaN 1e90QjPFvmpWFROtMl0nBA9w3dD= 1380.0 1.0 4298287.0 4330.0
1 43a28e13-df38-45a3-6038-f676b0ac61ba Windows 1.0 Direct 2018-12-27 20:52:30.428 Purchase WEB 3583134573 NaN NaN RRL2Ah/8qPfZDdARqQ/cT2gSO0L= 1447.0 1.0 4899494.0 4620.0
2 cae70e93-fec0-4c52-38a5-da5c0c77e976 Android 1.0 Direct 2018-06-07 06:17:15.701 Purchase WEB 3384591959 NaN NaN wdLh1oFx6lNjV1ENQtBVFglDVcD= 1280.0 1.0 4305841.0 2460.0
3 e903d188-cf9d-423a-b058-93d0c882ba2d Android 1.0 Facebook 2018-05-09 12:06:41.670 Purchase WEB 47808784R2 NaN NaN csoMr0O2AdKUG2M3k9m5HjwymtZ= 1280.0 1.0 4259098.0 2860.0
4 05d9c7be-ad65-48d6-17c7-2d2ddb14ab37 Android 1.0 Facebook 2018-03-12 07:34:54.457 Purchase WEB 3483273161 NaN NaN A75CPoWttFCvO3/M5NQxH51LLtb= 690.0 1.0 4167807.0 2170.0
In [16]:
member.head()
Out[16]:
UUID OnlineMemberId RegisterSourceTypeDef RegisterDate Gender Birthday IsAppInstalled IsEnableEmail IsEnableShortMessage IsEnablePushNotification MemberCardLevel
0 X03665298 NaN Store 2011/12/4 Female 1965/1/10 False True True True 10
1 T04434917 NaN Store 2011/12/19 Female 1900/1/1 False True True True 10
2 T03441135 NaN Store 2011/12/19 Female 1983/1/18 False True True True 20
3 W03686173 NaN Store 2011/12/28 Female 1980/1/1 False True True True 10
4 W03441148 NaN Store 2011/12/19 Female 1900/1/1 False True True True 10
In [17]:
order_main.head()
Out[17]:
UUID TransactionNum TradesDate ChannelType ChannelDetail PaymentType ShippingType BeforeMemberTierLevel AfterMemberTierLevel TsCount Qty TotalSalesAmount TotalPrice TotalDiscount TotalPromotionDiscount TotalCouponDiscount TotalLoyaltyDiscount Status
0 U2131156 84GbV1FZLS+u1nu1FgX7AxmHsab= 2018/7/10 Pos A039 Cash Store 0 0 1 1 2380 2380 0 0 0 0 Finish
1 V02663974 g3sJev+nMFHCc7J+3MIkXaeA3wV= 2018/7/3 Pos A042 Cash Store 0 0 1 1 390 390 0 0 0 0 Finish
2 R03555577 jajyDU5zj3+0NFjBDzIPiijAdDD= 2018/7/4 Pos A050 Cash Store 0 0 1 1 1480 1480 0 0 0 0 Finish
3 O07191104 W5qdVNdgnsWEx8a7ikzHw/KbGI6= 2018/7/7 Pos A051 Cash Store 0 0 1 1 1380 1380 0 0 0 0 Finish
4 U99462295 zVBGf/pLjTKHQFQUEQ5zUTvS8Yp= 2018/7/8 Pos A054 Cash Store 0 0 1 1 1280 1280 0 0 0 0 Finish
In [18]:
order_slave.head()
Out[18]:
UUID TransactionNum TradesDateTime ChannelType ChannelDetail PaymentType ShippingType OuterProductSkuCode ProductSkuCode SalePageCode Qty TotalSalesAmount TotalPrice TotalDiscount Status
0 T04536285 o2MqkS6y1AR0o4shLXAyZKye9Ct= 2016-04-01 18:07:20 Pos C030 NaN NaN 746140050903 NaN NaN 1 0 0 0 Finish
1 T04536285 o2MqkS6y1AR0o4shLXAyZKye9Ct= 2016-04-01 18:07:20 Pos C030 NaN NaN 746140050902 NaN NaN -1 0 0 0 Return
2 P1674651 gfsCr+82rsF9Qm1sU5KtXiKwlnN= 2016-04-03 16:10:27 Pos A057 NaN NaN 935200275002 NaN NaN 1 0 0 0 Finish
3 P1674651 gfsCr+82rsF9Qm1sU5KtXiKwlnN= 2016-04-03 16:10:27 Pos A057 NaN NaN 935200275001 NaN NaN -1 0 0 0 Return
4 N00537019 wEHcPVo+2/Jzi1Lj5RfCeG5JD+6= 2016-04-03 18:23:45 Pos A057 NaN NaN 116110010002 NaN NaN -1 0 0 0 Return

ntu_dac_91APP.db

In [19]:
conn = sqlite3.connect('ntu_dac_91APP.db')
query_str = """
SELECT *
    FROM order_slave
    LIMIT 5;
"""
pd.read_sql(query_str, conn)
Out[19]:
UUID TransactionNum TradesDateTime ChannelType ChannelDetail PaymentType ShippingType OuterProductSkuCode ProductSkuCode SalePageCode Qty TotalSalesAmount TotalPrice TotalDiscount Status
0 T04536285 o2MqkS6y1AR0o4shLXAyZKye9Ct= 2016-04-01 18:07:20 Pos C030 None None 7.461401e+11 None None 1 0 0 0 Finish
1 T04536285 o2MqkS6y1AR0o4shLXAyZKye9Ct= 2016-04-01 18:07:20 Pos C030 None None 7.461401e+11 None None -1 0 0 0 Return
2 P1674651 gfsCr+82rsF9Qm1sU5KtXiKwlnN= 2016-04-03 16:10:27 Pos A057 None None 9.352003e+11 None None 1 0 0 0 Finish
3 P1674651 gfsCr+82rsF9Qm1sU5KtXiKwlnN= 2016-04-03 16:10:27 Pos A057 None None 9.352003e+11 None None -1 0 0 0 Return
4 N00537019 wEHcPVo+2/Jzi1Lj5RfCeG5JD+6= 2016-04-03 18:23:45 Pos A057 None None 1.161100e+11 None None -1 0 0 0 Return

Once we've done importing dataset, there are some common attributes/methods to examine a DataFrame object

  • shape
  • info()
  • head()/tail()
  • describe()
In [20]:
order_main.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 508083 entries, 0 to 508082
Data columns (total 18 columns):
UUID                      508083 non-null object
TransactionNum            508083 non-null object
TradesDate                508083 non-null object
ChannelType               508083 non-null object
ChannelDetail             508083 non-null object
PaymentType               508083 non-null object
ShippingType              508083 non-null object
BeforeMemberTierLevel     508083 non-null int64
AfterMemberTierLevel      508083 non-null int64
TsCount                   508083 non-null int64
Qty                       508083 non-null int64
TotalSalesAmount          508083 non-null int64
TotalPrice                508083 non-null int64
TotalDiscount             508083 non-null int64
TotalPromotionDiscount    508083 non-null int64
TotalCouponDiscount       508083 non-null int64
TotalLoyaltyDiscount      508083 non-null int64
Status                    508083 non-null object
dtypes: int64(10), object(8)
memory usage: 69.8+ MB
In [21]:
order_main.describe()
Out[21]:
BeforeMemberTierLevel AfterMemberTierLevel TsCount Qty TotalSalesAmount TotalPrice TotalDiscount TotalPromotionDiscount TotalCouponDiscount TotalLoyaltyDiscount
count 508083.0 508083.0 508083.000000 508083.000000 508083.000000 508083.000000 508083.000000 508083.000000 508083.000000 508083.0
mean 0.0 0.0 2.247958 1.898645 2642.526402 2693.586961 -51.060559 -3.998362 -47.062197 0.0
std 0.0 0.0 1.959992 2.300044 3488.535635 3502.082785 178.671054 61.354120 168.924130 0.0
min 0.0 0.0 1.000000 -36.000000 -53790.000000 -53790.000000 -9100.000000 -9100.000000 -5000.000000 0.0
25% 0.0 0.0 1.000000 1.000000 1280.000000 1280.000000 0.000000 0.000000 0.000000 0.0
50% 0.0 0.0 2.000000 1.000000 1980.000000 2070.000000 0.000000 0.000000 0.000000 0.0
75% 0.0 0.0 3.000000 3.000000 3460.000000 3560.000000 0.000000 0.000000 0.000000 0.0
max 0.0 0.0 162.000000 162.000000 255960.000000 255960.000000 1000.000000 0.000000 1000.000000 0.0

Once we've done importing and examination, we can move on to general DataFrame operations

  • Selecting
  • Filtering
  • Subsetting
  • Indexing
  • Sorting
  • Deriving
  • Summarizing
  • Summarizing and Grouping

Selecting columns from DataFrame

In [22]:
cols_to_select = ['RegisterDate', 'Gender', 'Birthday']
member[cols_to_select]
Out[22]:
RegisterDate Gender Birthday
0 2011/12/4 Female 1965/1/10
1 2011/12/19 Female 1900/1/1
2 2011/12/19 Female 1983/1/18
3 2011/12/28 Female 1980/1/1
4 2011/12/19 Female 1900/1/1
... ... ... ...
159830 2018/12/19 Female 1966/1/23
159831 2018/12/18 Female 1963/12/18
159832 2018/12/16 Female 1966/11/7
159833 2018/12/25 Female 1986/6/15
159834 2018/12/29 Female 1959/3/26

159835 rows × 3 columns

Filtering rows from DataFrame

In [23]:
rows_to_select = member['Gender'] == 'Male'
member[rows_to_select]
Out[23]:
UUID OnlineMemberId RegisterSourceTypeDef RegisterDate Gender Birthday IsAppInstalled IsEnableEmail IsEnableShortMessage IsEnablePushNotification MemberCardLevel
53 S04520222 NaN Store 2011/12/25 Male 1982/4/11 False True True True 10
108 U03583772 NaN Store 2012/1/1 Male 1981/10/16 False True True True 10
109 R03583770 NaN Store 2011/12/18 Male 1982/10/14 False True True True 10
117 W04520226 NaN Store 2011/12/25 Male 1980/11/16 False True True True 20
120 P03583767 NaN Store 2011/12/10 Male 1984/11/23 False True True True 10
... ... ... ... ... ... ... ... ... ... ... ...
159198 T96759858 45844100P0 LocationWizard 2018/12/27 Male 1974/3/29 True True True True 10
159578 T96399026 3885746594 LocationWizard 2018/12/11 Male 1976/9/25 True True True True 10
159655 S96478806 3184873530 LocationWizard 2018/12/16 Male 1992/12/16 True False False False 10
159739 O96696810 3185458431 LocationWizard 2018/12/22 Male 1971/11/8 False True True True 10
159742 N96481013 3285768749 LocationWizard 2018/12/13 Male 1960/8/20 True True True True 10

1927 rows × 11 columns

Subsetting columns and rows simultaneously from DataFrame

In [24]:
cols_to_select = ['RegisterDate', 'Gender', 'Birthday']
rows_to_select = member['Gender'] == 'Male'
member[rows_to_select][cols_to_select]
Out[24]:
RegisterDate Gender Birthday
53 2011/12/25 Male 1982/4/11
108 2012/1/1 Male 1981/10/16
109 2011/12/18 Male 1982/10/14
117 2011/12/25 Male 1980/11/16
120 2011/12/10 Male 1984/11/23
... ... ... ...
159198 2018/12/27 Male 1974/3/29
159578 2018/12/11 Male 1976/9/25
159655 2018/12/16 Male 1992/12/16
159739 2018/12/22 Male 1971/11/8
159742 2018/12/13 Male 1960/8/20

1927 rows × 3 columns

Indexing DataFrame

  • loc[]: indexing with Index
  • iloc[]: indexing with absolute positions
In [25]:
male_members = member[rows_to_select]
male_members.loc[[53, 108, 109, 117, 120]]
Out[25]:
UUID OnlineMemberId RegisterSourceTypeDef RegisterDate Gender Birthday IsAppInstalled IsEnableEmail IsEnableShortMessage IsEnablePushNotification MemberCardLevel
53 S04520222 NaN Store 2011/12/25 Male 1982/4/11 False True True True 10
108 U03583772 NaN Store 2012/1/1 Male 1981/10/16 False True True True 10
109 R03583770 NaN Store 2011/12/18 Male 1982/10/14 False True True True 10
117 W04520226 NaN Store 2011/12/25 Male 1980/11/16 False True True True 20
120 P03583767 NaN Store 2011/12/10 Male 1984/11/23 False True True True 10
In [26]:
male_members = member[rows_to_select]
male_members.iloc[[0, 1, 2, 3, 4]]
Out[26]:
UUID OnlineMemberId RegisterSourceTypeDef RegisterDate Gender Birthday IsAppInstalled IsEnableEmail IsEnableShortMessage IsEnablePushNotification MemberCardLevel
53 S04520222 NaN Store 2011/12/25 Male 1982/4/11 False True True True 10
108 U03583772 NaN Store 2012/1/1 Male 1981/10/16 False True True True 10
109 R03583770 NaN Store 2011/12/18 Male 1982/10/14 False True True True 10
117 W04520226 NaN Store 2011/12/25 Male 1980/11/16 False True True True 20
120 P03583767 NaN Store 2011/12/10 Male 1984/11/23 False True True True 10

Sorting DataFrame

  • sort_values method: sorting with a specific column
  • sort_index method: sorting with the Index of DataFrame
In [27]:
order_main.sort_values('TotalSalesAmount')
Out[27]:
UUID TransactionNum TradesDate ChannelType ChannelDetail PaymentType ShippingType BeforeMemberTierLevel AfterMemberTierLevel TsCount Qty TotalSalesAmount TotalPrice TotalDiscount TotalPromotionDiscount TotalCouponDiscount TotalLoyaltyDiscount Status
485076 X93677453 sVy7Na3OtozUpbF36AULY2AtQHN= 2018/11/15 Pos C040 Cash Store 0 0 21 -21 -53790 -53790 0 0 0 0 Return
319578 W07477115 LygZlpowSkpJ75b+yz8kh2FSj5H= 2018/10/20 Pos A062 Cash Store 0 0 17 -17 -53480 -53480 0 0 0 0 Return
391437 W05026959 Nwtrv1T4K2fMSDhUyEEwICixRhN= 2018/1/27 Pos C016 Cash Store 0 0 2 -2 -48600 -48600 0 0 0 0 Return
26575 Q04238600 aIt3Om1iHgkaEYIyW3PqppC+6Oj= 2018/12/25 Pos A040 Cash Store 0 0 7 -7 -45080 -45080 0 0 0 0 Return
467183 X94626005 0IPFjgAPUpaMKon2xaa13Wojcw6= 2019/1/14 Pos A062 Cash Store 0 0 12 -12 -39590 -39590 0 0 0 0 Return
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
467167 W09391463 jWG1/FVzp3Mxf0q4QtcXgbNV4rb= 2019/1/3 Pos B055 Cash Store 0 0 27 27 93160 93160 0 0 0 0 Finish
467081 W7108275 JlNXzMm8L4rju/V9iLB0WYYlb76= 2019/1/29 Pos A015 Cash Store 0 0 14 14 99920 99920 0 0 0 0 Finish
485054 W05580065 OvWmBv66dTEPr2UdrUkudTdyY+Z= 2018/11/20 Pos A015 Cash Store 0 0 19 19 118520 118520 0 0 0 0 Finish
391324 W07813704 LPKULMkVNK+ptb4he4ApOcvjV1R= 2018/1/17 Pos A054 Cash Store 0 0 85 85 159740 159840 -100 0 -100 0 Finish
413632 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019/4/14 Pos C019 Cash Store 0 0 162 162 255960 255960 0 0 0 0 Finish

508083 rows × 18 columns

In [28]:
male_members.sort_index(ascending=False)
Out[28]:
UUID OnlineMemberId RegisterSourceTypeDef RegisterDate Gender Birthday IsAppInstalled IsEnableEmail IsEnableShortMessage IsEnablePushNotification MemberCardLevel
159742 N96481013 3285768749 LocationWizard 2018/12/13 Male 1960/8/20 True True True True 10
159739 O96696810 3185458431 LocationWizard 2018/12/22 Male 1971/11/8 False True True True 10
159655 S96478806 3184873530 LocationWizard 2018/12/16 Male 1992/12/16 True False False False 10
159578 T96399026 3885746594 LocationWizard 2018/12/11 Male 1976/9/25 True True True True 10
159198 T96759858 45844100P0 LocationWizard 2018/12/27 Male 1974/3/29 True True True True 10
... ... ... ... ... ... ... ... ... ... ... ...
120 P03583767 NaN Store 2011/12/10 Male 1984/11/23 False True True True 10
117 W04520226 NaN Store 2011/12/25 Male 1980/11/16 False True True True 20
109 R03583770 NaN Store 2011/12/18 Male 1982/10/14 False True True True 10
108 U03583772 NaN Store 2012/1/1 Male 1981/10/16 False True True True 10
53 S04520222 NaN Store 2011/12/25 Male 1982/4/11 False True True True 10

1927 rows × 11 columns

Deriving new variables from DataFrame

  • map with a dict
  • map with a function(or a lambda expression)
In [29]:
member['RegisterSourceTypeDef'].unique()
Out[29]:
array(['Store', 'iOSApp', 'LocationWizard', 'Web', 'AndroidApp', nan],
      dtype=object)
In [30]:
register_source_dict = {
    'Store': 'Non mobile app',
    'LocationWizard': 'Non mobile app',
    'Web': 'Non mobile app',
    'AndroidApp': 'Mobile app',
    'iOSApp': 'Mobile app',
    np.NaN: 'Non mobile app'
}
member['RegisterSourceTypeDef'].map(register_source_dict).value_counts()
Out[30]:
Non mobile app    129817
Mobile app         30018
Name: RegisterSourceTypeDef, dtype: int64
In [31]:
def register_source_func(x):
    if x in ['AndroidApp', 'iOSApp']:
        return 'Mobile app'
    else:
        return 'Non mobile app'
member['RegisterSourceTypeDef'].map(register_source_func).value_counts()
Out[31]:
Non mobile app    129817
Mobile app         30018
Name: RegisterSourceTypeDef, dtype: int64

Summarizing DataFrame with built-in methods

In [32]:
order_main['TotalSalesAmount'].sum()
Out[32]:
1342622742

Summarizing and grouping DataFrame with built-in methods

In [33]:
order_main.groupby('PaymentType')['TotalSalesAmount'].sum()
Out[33]:
PaymentType
ATM                    91308
Cash              1093109830
CreditCardOnce      80850669
Family              75396620
LinePay              6214910
SevenEleven         86959405
Name: TotalSalesAmount, dtype: int64

Once general operations are done, we can move on to advanced DataFrame operations

  • Text to datetime type
  • Dealing with missing values
  • Merging(joining) dataframes

Text to datetime type

In [34]:
print(behavior['HitDateTime'][:100])
0     2018-05-16 23:14:31.500
1     2018-12-27 20:52:30.428
2     2018-06-07 06:17:15.701
3     2018-05-09 12:06:41.670
4     2018-03-12 07:34:54.457
               ...           
95    2018-11-27 20:07:52.028
96    2018-07-16 23:21:10.482
97    2018-07-17 23:15:20.457
98    2018-07-17 21:44:05.717
99    2018-06-12 08:27:47.882
Name: HitDateTime, Length: 100, dtype: object
In [35]:
print(pd.to_datetime(behavior['HitDateTime'][:100]))
0    2018-05-16 23:14:31.500
1    2018-12-27 20:52:30.428
2    2018-06-07 06:17:15.701
3    2018-05-09 12:06:41.670
4    2018-03-12 07:34:54.457
               ...          
95   2018-11-27 20:07:52.028
96   2018-07-16 23:21:10.482
97   2018-07-17 23:15:20.457
98   2018-07-17 21:44:05.717
99   2018-06-12 08:27:47.882
Name: HitDateTime, Length: 100, dtype: datetime64[ns]
In [36]:
print(member['RegisterDate'])
0          2011/12/4
1         2011/12/19
2         2011/12/19
3         2011/12/28
4         2011/12/19
             ...    
159830    2018/12/19
159831    2018/12/18
159832    2018/12/16
159833    2018/12/25
159834    2018/12/29
Name: RegisterDate, Length: 159835, dtype: object
In [37]:
print(pd.to_datetime(member['RegisterDate']))
0        2011-12-04
1        2011-12-19
2        2011-12-19
3        2011-12-28
4        2011-12-19
            ...    
159830   2018-12-19
159831   2018-12-18
159832   2018-12-16
159833   2018-12-25
159834   2018-12-29
Name: RegisterDate, Length: 159835, dtype: datetime64[ns]

Dealing with missing values

  • Using isnull and notnull to check if np.NaN exists
  • Using dropna to drop rows with np.NaN
  • Using fillna to fill np.NaN with specific values
In [38]:
print(member.shape[0])
print("===")
print(member['RegisterSourceTypeDef'].isnull().sum())
print(member['RegisterSourceTypeDef'].notnull().sum())
print("===")
print(member.dropna().shape[0])
print("===")
print(member['RegisterSourceTypeDef'].fillna('Unknown').isnull().sum())
print(member['RegisterSourceTypeDef'].fillna('Unknown').notnull().sum())
159835
===
14
159821
===
94067
===
0
159835

Merging(joining) dataframes

  • merge function on column
  • join method on index
In [39]:
# Retrieve the order_main of max TotalSalesAmount
max_ttl_sales_amt = order_main['TotalSalesAmount'].max()
max_order_main = order_main[order_main.TotalSalesAmount == max_ttl_sales_amt]
max_order_main
Out[39]:
UUID TransactionNum TradesDate ChannelType ChannelDetail PaymentType ShippingType BeforeMemberTierLevel AfterMemberTierLevel TsCount Qty TotalSalesAmount TotalPrice TotalDiscount TotalPromotionDiscount TotalCouponDiscount TotalLoyaltyDiscount Status
413632 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019/4/14 Pos C019 Cash Store 0 0 162 162 255960 255960 0 0 0 0 Finish
In [40]:
# Retrieve the order_slave of max TotalSalesAmount
max_transaction_num = max_order_main['TransactionNum'].values[0]
max_order_slave = order_slave[order_slave['TransactionNum'] == max_transaction_num]
max_order_slave
Out[40]:
UUID TransactionNum TradesDateTime ChannelType ChannelDetail PaymentType ShippingType OuterProductSkuCode ProductSkuCode SalePageCode Qty TotalSalesAmount TotalPrice TotalDiscount Status
1087514 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990901 NaN NaN 1 255960 255960 0 Finish
1087515 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990901 NaN NaN 1 255960 255960 0 Finish
1087516 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990901 NaN NaN 1 255960 255960 0 Finish
1087517 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990901 NaN NaN 1 255960 255960 0 Finish
1087519 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990901 NaN NaN 1 255960 255960 0 Finish
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1087679 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990907 NaN NaN 1 255960 255960 0 Finish
1087680 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990907 NaN NaN 1 255960 255960 0 Finish
1087681 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990907 NaN NaN 1 255960 255960 0 Finish
1087682 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990907 NaN NaN 1 255960 255960 0 Finish
1087683 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990907 NaN NaN 1 255960 255960 0 Finish

162 rows × 15 columns

We can use either UUID or TransactionNum as our foreign keys

In [41]:
pd.merge(max_order_slave, max_order_main, left_on='UUID', right_on='UUID', suffixes=('_slave', '_main'))
Out[41]:
UUID TransactionNum_slave TradesDateTime ChannelType_slave ChannelDetail_slave PaymentType_slave ShippingType_slave OuterProductSkuCode ProductSkuCode SalePageCode ... AfterMemberTierLevel TsCount Qty_main TotalSalesAmount_main TotalPrice_main TotalDiscount_main TotalPromotionDiscount TotalCouponDiscount TotalLoyaltyDiscount Status_main
0 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990901 NaN NaN ... 0 162 162 255960 255960 0 0 0 0 Finish
1 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990901 NaN NaN ... 0 162 162 255960 255960 0 0 0 0 Finish
2 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990901 NaN NaN ... 0 162 162 255960 255960 0 0 0 0 Finish
3 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990901 NaN NaN ... 0 162 162 255960 255960 0 0 0 0 Finish
4 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990901 NaN NaN ... 0 162 162 255960 255960 0 0 0 0 Finish
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
157 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990907 NaN NaN ... 0 162 162 255960 255960 0 0 0 0 Finish
158 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990907 NaN NaN ... 0 162 162 255960 255960 0 0 0 0 Finish
159 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990907 NaN NaN ... 0 162 162 255960 255960 0 0 0 0 Finish
160 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990907 NaN NaN ... 0 162 162 255960 255960 0 0 0 0 Finish
161 R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990907 NaN NaN ... 0 162 162 255960 255960 0 0 0 0 Finish

162 rows × 32 columns

If we want to adapt the join method, set FK to index first

In [42]:
max_order_slave = max_order_slave.set_index('UUID')
max_order_main = max_order_main.set_index('UUID')
max_order_slave.join(max_order_main, lsuffix='_slave', rsuffix='_main')
Out[42]:
TransactionNum_slave TradesDateTime ChannelType_slave ChannelDetail_slave PaymentType_slave ShippingType_slave OuterProductSkuCode ProductSkuCode SalePageCode Qty_slave ... AfterMemberTierLevel TsCount Qty_main TotalSalesAmount_main TotalPrice_main TotalDiscount_main TotalPromotionDiscount TotalCouponDiscount TotalLoyaltyDiscount Status_main
UUID
R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990901 NaN NaN 1 ... 0 162 162 255960 255960 0 0 0 0 Finish
R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990901 NaN NaN 1 ... 0 162 162 255960 255960 0 0 0 0 Finish
R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990901 NaN NaN 1 ... 0 162 162 255960 255960 0 0 0 0 Finish
R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990901 NaN NaN 1 ... 0 162 162 255960 255960 0 0 0 0 Finish
R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990901 NaN NaN 1 ... 0 162 162 255960 255960 0 0 0 0 Finish
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990907 NaN NaN 1 ... 0 162 162 255960 255960 0 0 0 0 Finish
R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990907 NaN NaN 1 ... 0 162 162 255960 255960 0 0 0 0 Finish
R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990907 NaN NaN 1 ... 0 162 162 255960 255960 0 0 0 0 Finish
R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990907 NaN NaN 1 ... 0 162 162 255960 255960 0 0 0 0 Finish
R04234145 zK6XGNghLYS3ZuQloS4ocxEXeW8= 2019-04-14 17:58:33 Pos C019 NaN NaN 509120990907 NaN NaN 1 ... 0 162 162 255960 255960 0 0 0 0 Finish

162 rows × 31 columns

Practices for pandas operations

  • Import 4 91APP files into your own Python environment
  • Fill np.nan with order_slave['PaymentType'] and order_slave['ShippingType'] for the max TotalSalesAmount order(TransactionNum: zK6XGNghLYS3ZuQloS4ocxEXeW8=)

Practices for pandas operations: Solutions

In [43]:
# Import 4 91APP files into your own Python environment
# Refer to previous slides
In [44]:
# Fill np.nan with order_slave['PaymentType'] and order_slave['ShippingType'] for the max TotalSalesAmount order(TransactionNum: zK6XGNghLYS3ZuQloS4ocxEXeW8=)
# Retrieve the order_main of max TotalSalesAmount
max_ttl_sales_amt = order_main['TotalSalesAmount'].max()
max_order_main = order_main[order_main.TotalSalesAmount == max_ttl_sales_amt]
# Retrieve the order_slave of max TotalSalesAmount
max_transaction_num = max_order_main['TransactionNum'].values[0]
max_order_slave = order_slave[order_slave['TransactionNum'] == max_transaction_num]
# Retrieve imputed values
impute_payment_type = max_order_main['PaymentType'].values[0]
impute_shipping_type = max_order_main['ShippingType'].values[0]
In [45]:
# Before fillna
print(max_order_slave['PaymentType'])
print(max_order_slave['ShippingType'])
1087514    NaN
1087515    NaN
1087516    NaN
1087517    NaN
1087519    NaN
          ... 
1087679    NaN
1087680    NaN
1087681    NaN
1087682    NaN
1087683    NaN
Name: PaymentType, Length: 162, dtype: object
1087514    NaN
1087515    NaN
1087516    NaN
1087517    NaN
1087519    NaN
          ... 
1087679    NaN
1087680    NaN
1087681    NaN
1087682    NaN
1087683    NaN
Name: ShippingType, Length: 162, dtype: object
In [46]:
# After fillna
print(max_order_slave['PaymentType'].fillna(impute_payment_type))
print(max_order_slave['ShippingType'].fillna(impute_shipping_type))
1087514    Cash
1087515    Cash
1087516    Cash
1087517    Cash
1087519    Cash
           ... 
1087679    Cash
1087680    Cash
1087681    Cash
1087682    Cash
1087683    Cash
Name: PaymentType, Length: 162, dtype: object
1087514    Store
1087515    Store
1087516    Store
1087517    Store
1087519    Store
           ...  
1087679    Store
1087680    Store
1087681    Store
1087682    Store
1087683    Store
Name: ShippingType, Length: 162, dtype: object

Data Visualization with Python

TL; DR

We briefly introduce visualization and how to use popular packages for visualization in Python with 91APP dataset.

What is data visualization?

Visualization is a tool that addresses the gap between data, function, formula, and viewers. Effective visualizations transform abstract collections of numbers into shapes that viewers quickly grasp and understand.

Source: Data Visualization with JavaScript

Why data visualization: using as a tool that addresses the gap between data and viewers

what are $x$ and $f$?

In [48]:
print(x)
print(f)
[ 0.          0.25645654  0.51291309  0.76936963  1.02582617  1.28228272
  1.53873926  1.7951958   2.05165235  2.30810889  2.56456543  2.82102197
  3.07747852  3.33393506  3.5903916   3.84684815  4.10330469  4.35976123
  4.61621778  4.87267432  5.12913086  5.38558741  5.64204395  5.89850049
  6.15495704  6.41141358  6.66787012  6.92432667  7.18078321  7.43723975
  7.69369629  7.95015284  8.20660938  8.46306592  8.71952247  8.97597901
  9.23243555  9.4888921   9.74534864 10.00180518 10.25826173 10.51471827
 10.77117481 11.02763136 11.2840879  11.54054444 11.79700098 12.05345753
 12.30991407 12.56637061]
[ 0.00000000e+00  2.53654584e-01  4.90717552e-01  6.95682551e-01
  8.55142763e-01  9.58667853e-01  9.99486216e-01  9.74927912e-01
  8.86599306e-01  7.40277997e-01  5.45534901e-01  3.15108218e-01
  6.40702200e-02 -1.91158629e-01 -4.33883739e-01 -6.48228395e-01
 -8.20172255e-01 -9.38468422e-01 -9.95379113e-01 -9.87181783e-01
 -9.14412623e-01 -7.81831482e-01 -5.98110530e-01 -3.75267005e-01
 -1.27877162e-01  1.27877162e-01  3.75267005e-01  5.98110530e-01
  7.81831482e-01  9.14412623e-01  9.87181783e-01  9.95379113e-01
  9.38468422e-01  8.20172255e-01  6.48228395e-01  4.33883739e-01
  1.91158629e-01 -6.40702200e-02 -3.15108218e-01 -5.45534901e-01
 -7.40277997e-01 -8.86599306e-01 -9.74927912e-01 -9.99486216e-01
 -9.58667853e-01 -8.55142763e-01 -6.95682551e-01 -4.90717552e-01
 -2.53654584e-01 -4.89858720e-16]

Gap addressed: data and viewers

$$f(x)=sin(x), \text{where } 0 \leq x \leq 4\pi$$
In [50]:
plot_sinx(x, f)

Why data visualization: using as a tool that addresses the gap between function, formula and viewers

$$S(x) = \frac{1}{1 + e^{-x}}$$

Gap addressed: function, formula and viewers

In [52]:
plot_sigmoid(x, S)

Visualization also plays a key role in machine learning

Machine learning engineer uses line plot to visualize cost function

In [54]:
plot_logistic_cost_function(x)

Machine learning engineer uses surface plot or contour plot to visualize gradients

Imgur

Source: Coursera | Machine Learning

Machine learning engineer also uses contour plot to visualize gradients

Imgur

Source: Coursera | Machine Learning

In [56]:
plot_multiple_decision_boundary()

Minard's map of Napoleon's disastrous Russian campaign of 1812

In 1812, Napoleon marched to Moscow. 98% of his soldiers died. The simple but fascinating temperature line below the viz shows how cold ultimately defeated Napoleon’s army.

Imgur

Hans Rosling's 200 Countries, 200 Years, 4 Minutes

A 2007 TED talk for the Swedish scientist shared his passion with the world. It shows the relationship between wealth and health.

Imgur

A quick look back to the data science pipeline

Imgur

Source: R for Data Science

Visualization plays heavy roles in

  • Visualization
  • Modeling
  • Communicattion
  • pandas
  • seaborn
  • matplotlib
  • plotly

Difference between these libraries

  • Plotting with DataFrame or Series
    • pandas
    • seaborn
  • Plotting with list, ndarray, or Series
    • matplotlib
  • Plotting with interactivity
    • plotly

Importing these libraries

In [57]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Getting data for a quickstart on these libraries

In [58]:
order_slave_dtypes = {
    'PaymentType': 'str',
    'ShippingType': 'str'
}
behavior_dtypes = {
    'CategoryId': 'str',
    'SearchKeyWord': 'str',
    'TransactionNum': 'str'
}
member = pd.read_csv('MemberDataForNTU.csv')
order_main = pd.read_csv('OrderDataForNTU.csv')
order_slave = pd.read_csv('OrderSlaveDataForNTU.txt', dtype=order_slave_dtypes)
behavior = pd.read_csv('BehaviorDataForNTU.csv', dtype=behavior_dtypes)

Choose appropriate plots according to analytical purposes

  • Bar for ranking and comparison
  • Scatter for correlation
  • Histogram for distribution
  • Line for trend
  • Box for distribution across different categories

How to plot with pandas

DataFrame class has a .plot() method with several key parameters.

Make a groupby summary with pandas

In [59]:
order_main.groupby('PaymentType')['TotalSalesAmount'].sum().sort_values()
Out[59]:
PaymentType
ATM                    91308
LinePay              6214910
Family              75396620
CreditCardOnce      80850669
SevenEleven         86959405
Cash              1093109830
Name: TotalSalesAmount, dtype: int64

Make a barh with pandas following groupby summary

In [60]:
sale_amount_by_payment_type = order_main.groupby('PaymentType')['TotalSalesAmount'].sum().sort_values()
sale_amount_by_payment_type.plot(kind='barh')
plt.show()

Make another groupby summary with pandas

In [61]:
order_main['TradesDate'] = pd.to_datetime(order_main['TradesDate'])
order_main.groupby('TradesDate')['TotalSalesAmount'].sum()
Out[61]:
TradesDate
2016-01-01     920950
2016-01-02     966747
2016-01-03     704524
2016-01-04     660950
2016-01-05     717040
               ...   
2019-04-27    1575247
2019-04-28    1285845
2019-04-29    1292115
2019-04-30    2225314
2019-05-01      26440
Name: TotalSalesAmount, Length: 1217, dtype: int64

Make a line with pandas following groupby summary

In [62]:
sale_amount_by_trades_date = order_main.groupby('TradesDate')['TotalSalesAmount'].sum()
sale_amount_by_trades_date.plot(kind='line')
plt.show()

How to plot with seaborn

Using different plotting functions

  • barplot function for a bar
  • lineplot function for a line

Transform groupby summary to a DataFrame with pandas

In [63]:
sale_amount_by_payment_type_df = pd.DataFrame(sale_amount_by_payment_type).reset_index().sort_values('TotalSalesAmount', ascending=False)
sale_amount_by_payment_type_df
Out[63]:
PaymentType TotalSalesAmount
5 Cash 1093109830
4 SevenEleven 86959405
3 CreditCardOnce 80850669
2 Family 75396620
1 LinePay 6214910
0 ATM 91308

barplot function for a bar

In [64]:
sns.barplot(x='TotalSalesAmount', y='PaymentType', data=sale_amount_by_payment_type_df)
plt.show()

Transform groupby summary to another DataFrame with pandas

In [65]:
sale_amount_by_trades_date_shipping_type = order_main.groupby(['TradesDate', 'ShippingType'])['TotalSalesAmount'].sum()
sale_amount_by_trades_date_shipping_type_df = pd.DataFrame(sale_amount_by_trades_date_shipping_type).reset_index()
sale_amount_by_trades_date_shipping_type_df
Out[65]:
TradesDate ShippingType TotalSalesAmount
0 2016-01-01 Family 3740
1 2016-01-01 Home 18370
2 2016-01-01 SevenEleven 18780
3 2016-01-01 Store 880060
4 2016-01-02 Family 18260
... ... ... ...
7147 2019-04-30 Store 1682940
7148 2019-05-01 Family 4830
7149 2019-05-01 Home 16680
7150 2019-05-01 LocationPickup 2660
7151 2019-05-01 SevenEleven 2270

7152 rows × 3 columns

lineplot function for a line

In [66]:
sns.lineplot(x='TradesDate', y='TotalSalesAmount', hue='ShippingType', data=sale_amount_by_trades_date_shipping_type_df)
plt.show()

How to plot with matplotlib

  • Arranging data to iterables: list, ndarray or Series
  • Call figure function for a "figure" object
  • Call axes function for an "axis" object
  • Call plotting methods from an axis object
  • Add elements
  • Call show function for display

Take a bar for example: figure and axes first

In [67]:
fig = plt.figure()
axes = plt.axes()

Take a bar for example: call plotting methods and show plot

In [68]:
fig = plt.figure()
axes = plt.axes()
axes.barh(sale_amount_by_payment_type.index, sale_amount_by_payment_type.values)
plt.show()

Take a line for another example: figure and axes first

In [69]:
fig = plt.figure()
axes = plt.axes()

Take a line for another example: call plotting methods and show plot

In [70]:
fig = plt.figure()
axes = plt.axes()
x = sale_amount_by_trades_date.index
y = sale_amount_by_trades_date.values
axes.plot(x, y)
plt.show()

The line seems a bit too narrow for us...maybe it's time to add interactivity on it

In [71]:
import plotly.graph_objs as go
In [72]:
# Create df
df = pd.DataFrame(sale_amount_by_trades_date).reset_index()
# Create figure
layout = go.Layout(
    autosize=False,
    width=720,
    height=480,
    margin=go.layout.Margin(
        l=50,
        r=50,
        b=100,
        t=100,
        pad=4
    ),
    bargroupgap=0.3
)
fig = go.Figure(layout=layout)
In [73]:
fig.add_trace(
    go.Scatter(x=list(df.TradesDate), y=list(df.TotalSalesAmount)))
# Set title
fig.update_layout(
    title_text="Time series with range slider and selectors"
)
In [74]:
# Add range slider
fig.update_layout(
    xaxis=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=1,
                     label="1m",
                     step="month",
                     stepmode="backward"),
                dict(count=6,
                     label="6m",
                     step="month",
                     stepmode="backward"),
                dict(count=1,
                     label="YTD",
                     step="year",
                     stepmode="todate"),
                dict(count=1,
                     label="1y",
                     step="year",
                     stepmode="backward"),
                dict(step="all")
            ])
        ),
        rangeslider=dict(
            visible=True
        ),
        type="date"
    )
)
In [75]:
fig.show()